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14 Abstract 

15 Background 

16 The identification of modules or communities of related variables is a key step in the 

17 analysis and modelling of biological systems. Many module identification procedures are 

18 available, but few of these can determine the module partitions best fitting a given dataset in 

19 the absence of previous information, in an unsupervised way, and when the links between 

20 variables have different weights. Here I propose such a procedure, which uses the stability 

21 under bootstrap resampling of different alternative module structures as a criterion to identify 

22 the structure best fitting to a set of variables. In its present implementation, the procedure 

23 uses linear correlations as link weights. 

24 Results 
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1 Computer simulations show that the procedure is useful for problems involving 

2 moderate numbers of variables, such as those commonly found in gene regulation cascades or 

3 metabolic pathways, and also that it can detect hierarchical network structures, in which 

4 modules are composed of smaller sub modules. The procedure becomes less practical as the 

5 number of variables increases, due to increases in processing time. 

6 Conclusions 

7 The proposed procedure may be a valuable and robust network analysis tool. Because it is 

8 based on comparing the amount of evidence for different module partitions structures, this 

9 procedure may detect the existence of hierarchical network structures. 
10 

11 
12 

13 Keywords: network structure, hierarchical communities, weighted networks, unsupervised 

14 clustering, modularity. 
15 

16 
17 

18 Background 

19 Complex systems are often modelled and analysed as networks of related elements 

20 (nodes) connected by edges representing the relationship between them [1]. In Biology, many 

21 of these networks show a modular structure: nodes can be grouped into communities or 

22 modules so that there is a dense web of edges among nodes in the same module and a thin 

23 one between nodes in different modules [2-5]. Such structure, or modularity, has been 

24 observed in gene expression networks [6, 7], protein-protein interactions [8], metabolic [9] 

25 [10] and developmental [11, 12] pathways, and species interactions in ecosystems [13, 14]. 
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1 

2 The building of models for the study of network structure, function, regulation or 

3 evolution may require the use of module identification (also called community detection) 

4 procedures. Many such procedures have been proposed. Some are confirmatory, requiring a 

5 prior knowledge of module demarcations or at least of the number of modules present (e.g., 

6 [15, 16]). Other procedures (e.g., [17-23]) are exploratory and unsupervised, making no 

7 assumptions about modules. They use a previously known set of edges between nodes to 

8 identify the partitions among nodes that maximize some criterion of modular structure. 

9 Typically, they do not consider variation in the strength of the links between different nodes, 

10 i.e., they are unweighted. This may of course entail a loss of relevant information, as the 

1 1 heterogeneity in edge weights may be fundamental for the understanding of the whole 

12 network [24]. There are finally exploratory, unsupervised procedures considering different 

13 weights for each edge. In this category are the procedures of Rosvall and Bergstrom [25], 

14 based on simulated annealing and Blondel et al. [26] based on the maximization of 

15 modularity, defined as the number of edges falling within modules minus the number 

16 expected if edges were placed at random [19]. These two procedures are fast and applicable 

17 to very large networks [27], but they do not take into account the degree of precision in the 

18 estimation of these edges' weights. This may be not critical when the focus is on the 

19 identification of large-scale patterns in big data sets, but might become an important 

20 limitation in smaller problems as those typically found in the analysis of gene regulation 

21 pathways, where the basic aim is to determine the location of variables into particular 

22 modules. In these situations it can be important to consider the robustness of module 

23 allocations, which would depend heavily on the precision in the estimation of edges weights 

24 [24]. 
25 
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1 Here I propose a new procedure combining a clustering algorithm with bootstrap 

2 resampling to identify modules of correlated variables measured in the same individuals. In 

3 cluster analysis terminology, these modules would be R-mode (because it is the variables, not 

4 the measured individuals that are being grouped [28]) variational (because the edges consist 

5 on correlations between the variables, which are represented as nodes [29]) clusters. The 

6 procedure takes into account that the correlations constituting the network edges may vary 

7 and their value may have been estimated with limited precision. I use computer simulation to 

8 show that it is superior to the procedure of Blondel et al. in the identification of variational 

9 modules in data sets with a moderate number of variables, and also that it can detect the 
10 existence of hierarchical module structures. 

11 

12 Implementation 

13 For an n-variables dataset, a clustering method (in the present implementation of the 

14 procedure, k-means clustering, based on the R kmeans function) is applied to obtain partitions 

15 into 2 to n-1 clusters. A vector of variables' coincidences c of length (n -n)/2 (i.e., the number 

16 of non-redundant pair wise combinations of variables) is obtained for each of these n-2 

17 cluster analyses, with values of 1 if the two corresponding variables were assigned to the 

18 same cluster in that analysis and of 0 otherwise. Now the stability of each of the n-2 analyses 

19 is tested by bootstrap resampling of the individuals' observations in the original dataset. For 

20 each resample, the above n-2 cluster analyses are done and the corresponding c vectors 

21 obtained. These vectors are then compared across bootstrap samples. 
22 

23 If a real, detectable module structure existed in the data, bootstrap-replicated cluster 

24 analyses considering the real number of modules-clusters would tend to allocate variables in 

25 the same clusters, so that the variance across resamples would be low for each element in c. 
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1 A given pair of variables would tend to be either in the same cluster, the corresponding c 

2 value being one in most of resamples, or in different clusters, the c tending to be zero. In 

3 analyses considering wrong numbers of clusters -or analyses of data with no community 

4 structure-, each bootstrap replicate would result in clusters containing random combinations 

5 of variables, and the c values variance across bootstraps would be higher. In the procedure 

6 proposed here, the variances between resamples are calculated for each element of c and 

7 number of clusters, and the number of cluster partitions with the minimum value for the sum 

8 of these (n-n)/2 variances (i.e., that resulting in the most stable c vectors) is selected as the 

9 best estimate of community structure in the original data. Figure 1 illustrates the basic 

10 framework for this approach. The sum of variances can be used to compare the results 

1 1 obtained for the different numbers of clusters. 
12 

13 It must be taken into account however that the distribution of this sum of variances is 

14 not independent of the number and size of clusters considered in the successive n-2 analyses. 

15 To correct for this effect, the sums are made relative to their expected values in a null 

16 situation with the same number of clusters and a lack of correlation between variables. The 

17 result is the variance criterion used below. The null situation values are obtained by 

18 randomizing the observed variable values independently across individuals. Thus, while the 

19 univariate distributions are maintained, any correlation between variables disappears. 
20 

21 I studied the performance of the proposed method in simulated datasets of grouped 

22 variables xy : 
23 

24 Xfj — C( H~ €{j 

25 
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1 where q was common to all x variables in module i and caused correlation among 

2 them, and ey was specific to each x. The considered datasets differed in number of variables, 

3 distribution of module sizes, total number of observations, correlation between variables in 

4 the same module and variables distributions (Table 1). 
5 

6 I studied the ability of the proposed method to detect hierarchical correlation 

7 structures (i.e., the presence of sub-modules within modules) by simulating datasets with 

8 variables: 
9 

10 x ijk = gi + Sij + e ijk 
11 

12 where gi, sy and are module, sub-module and variable-specific effects. 

13 

14 

15 Results and discussion 

16 In the non-hierarchical cases, the proposed procedure was able to identify the correct 

17 number of modules even for small size samples and moderate correlations between variables 

18 in the same module (Fig. 2). Thus, a sample size of 25 (Fig. 2e) was enough to easily identify 

19 two modules of four variables having a correlation of 0.375, and modules of variables having 

20 a correlation of 0.231 were easily detected using samples of size 100 (Fig. 2d). The 

21 performance of the procedure did not obviously depend on module number and size, the 

22 homogeneity of these sizes (Fig. 2g, 2h) or the variables' distributions (Fig. 2i, 2j). The less 

23 favourable situations were those with the lowest correlation within modules (0.167, Fig. 2c) 

24 and the lowest number of variables (four variables, Fig. 2k). In the latter case, the variance 
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1 criterion was clearly under the corresponding value for the null case, but the differences 

2 between the two and the three clusters solutions were very slight. 
3 

4 The proposed procedure detected was able to detect hierarchical modular structures, 

5 especially when the hierarchy was regular, i.e., the pattern of subdivision was the same in all 

6 clusters (Fig. 3a, 3d and 3e). These regular partitions appeared as local minima in the sum of 

7 variances profile: two and four clusters in Figure 3a; two, four and eight clusters in Figure 3d. 

8 The procedure failed in the case of four modules and eight sub-modules (Fig. 3e) in which 

9 there the second local minimum was found for nine clusters, instead of eight. This suggests 

10 that correct community identification might require larger sample sizes as datasets become 

1 1 less structured and the number of independent modules increases. 
12 

13 Defining a single correct result becomes harder for less regular partitions. For 

14 example, in Figure 3b two or three clusters could be identified. While the partition into two 

15 modules was easily identified, that into three modules resulted in a local maximum instead of 

16 a minimum. This maximum disappeared when the correlation between variables in the large 

17 module in the right of the diagram increased (Fig. 3c), which, not unexpectedly, suggests that 

18 community detection is easier when edges within these communities are strong. In any case, 

19 the low criterion values for two and three clusters seen in Figure 3c would not be 

20 unambiguous evidence of hierarchical clustering, because the criterion values neighbouring a 

21 minimum can be also low in non-hierarchical situations (see for example Fig. lh and Fig. Ik). 
22 

23 Figures 3e and 3f show many consecutive low values for the variance criterion. This 

24 could be in relation with the fact that many partitions are possible in these cases. For 

25 example, partitions into four, five, six or eight clusters could be possible in Figure 3e. 
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1 However, this could not explain all results. The criterion values remained low beyond eight, 

2 the last "correct" number of clusters. In any case, a comparison of Figures 2 and 3 suggests 

3 that profiles showing several points of inflexion could be indicators of hierarchical modular 

4 structures. 
5 

6 I made multi-sample simulations to compare the proposed procedure with that of 

7 Blondel et al. (for the latter I used the R CRAN package igraph [29]). Neither procedure ever 

8 failed to identify two modules for sample sizes of one hundred and moderate correlations of 

9 0.375 (Fig. 4 2C3). The Blondel et al. procedure was somewhat better than that proposed here 

10 when the correlation was reduced to 0.167 (Fig. 4 2C1). However, it was clearly worse in the 

11 case of four modules, as it failed to find four clusters as the most frequent result when the 

12 correlation was 0.375 (Fig. 4 4C3) and completely failed to detect them when the correlation 

13 was 0.167 (Fig. 4 4C1). In the same situations, the right solution of four clusters was that 

14 most frequently found by the proposed procedure. 
15 

16 In the hierarchical cases, the Blondel et al. procedure found only two clusters in an 

17 overwhelming majority of replicates, whereas the proposed procedure found two and four 

18 clusters as the most frequent solutions. For individual replicates, the proposed procedure 

19 would detect hierarchical situations as multiple minima for the variance criterion, as in Figure 

20 3a. The proposed procedure was able to detect the hierarchical structure in most replicates 

21 when the correlation was moderate (Fig. 4 2/2C3, in italics), but only in a minority when the 

22 correlation was low (Fig. 4 2/2C1). 
23 

24 The proposed procedure was better than the Blondel et al. procedure in most 

25 simulations made here, especially for the cases involving the smallest module sizes. The low 
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1 performance of the Blondel et al. procedure was likely related to the modularity-based 

2 methods' "resolution limit in community detection". This limit is most likely to occur when 

3 the number of module internal links is of the order of the square root of twice the total 

4 number of links in the network or smaller (Fortunato and Barthelemy 2007), i.e., when 

5 modules are small. The proposed procedure seems to be unaffected by that limit, since it can 

6 easily detect two-variables modules (see Fig. lb). However, the use of bootstrap resampling 

7 by this procedure might command too many computational resources to be practical for the 

8 analysis of large sets of variables, as those in genome-wide or human social networks. Thus, 

9 the proposed procedure could be a useful complement for low computational complexity, 

10 large-scale procedures such as that of Blondel's et al., especially when small modules are 

1 1 involved. The two procedures would not be equivalent for any problem size, however, 

12 because they do not use the same kind of information. Instead of starting with a previously 

13 known set of edge weights, the procedure proposed here simultaneously estimates both the 

14 weights and the community structure. The approach of measuring the consistency of a found 

15 community structure was already proposed by Duch and Arenas [21]. They used an extremal 

16 optimization algorithm that could result in different network partitions in different runs, so 

17 that they could calculate the fraction of times a pair of nodes was allocated to the same 

18 module. However, they did not use consistency as a criterion to identify the optimum 

19 community structure among a set of possible structures, as done here. 
20 

21 Figure 2 considers only from 2 to n-1 as possible cluster numbers. This is because 

22 considering coincidences in module allocation (and it could be argued that the very idea of 

23 clustering) does not make sense when there are n clusters of size one (and therefore no 

24 coincidences) or there is a single cluster including all variables (total coincidence). However, 

25 because the proposed procedure compares the obtained results with those expected under the 



9 



Downloaded from http://biorxiv.org/ on September 18, 2014 

1 absence of community structure, it is possible to detect this absence: if none of the 2 to n-1 

2 partitions were below the lower 2.5 percentile of the null distribution. In such a case, it would 

3 be concluded that there is no community structure. Thus, the proposed procedure does not 

4 only provide an estimate of the number of modules, but also of the reliability of that estimate 

5 and of the overall degree of structure in the data. It also makes it possible to compare the 

6 reliability of alternative solutions. 
7 

8 It must be noted that, while being able to detect some hierarchical modular structures, 

9 the proposed procedure does not provide a formal diagnostic for such structures. As seen in 

10 the results section, some of the variance criterion results could correspond both to 

1 1 hierarchical and no-hierarchical structures. These structures are more easily detected when 

12 the hierarchy is regular, in the sense that variables groups are composed of the same number 

13 of subgroups, of the same size and the same correlation between variables. This tends to 

14 result in separate minima for the variance criterion, which is characteristic of hierarchical 

15 structures. 
16 

17 The present formulation of the proposed procedure uses correlations as distance 

18 measures between variables and K-means as the clustering algorithm, but its approach of 

19 evaluating alternative partitionings based on measuring its consistency in the face of 

20 resampling would be compatible in principle with any combination of distance definitions 

21 and clustering algorithms. 
22 

23 Conclusions 
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1 The proposed procedure could be a useful tool for the analysis of networks of small to 

2 moderate size, making it possible to get an unsupervised estimate of the number of clusters 

3 present. 
4 

5 
6 

7 Availability and requirements 

8 BoCluSt is available as an R function in Sourceforge: 

9 http://sourceforge.net/projects/boclust/files/BoCluSt.txt/download 
10 

11 
12 
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15 
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9 

10 Figure 1. Community structure detection in a four variables (A to D) example. Cluster 

1 1 analyses considering two (up) and three clusters (down) are applied to four resamples (Rs) 

12 drawn from the same original data set. In each resample, each pair of variables is given a 

13 value of one if the two variables were allocated to the same cluster and zero otherwise. When 

14 the number of clusters considered in the analysis corresponds to the true community structure 

15 in the data (the two clusters case in this example), variables tend to be allocated to the same 

16 clusters in different resamples (Co - cluster composition). Thus, the above 0/1 values 

17 recording the presence/absence of coincidences have low variances (in cursive) when 

18 calculated within pairs of variables (columns in the arrays). Variances increase when other 

19 cluster numbers are considered (three in this example) because allocations become less stable 

20 under resampling. The sum of c variances (on a grey background) will be minimum for the 

21 number of clusters best fitting the true community structure. 
22 

23 Figure 2. Values for the variance criterion (circles) in the cases listed in Table 1 (a 

24 single, randomly taken simulated data set per case, with 100 randomized null data sets and 

25 500 bootstrap resamples per data set), along with the lower 2.5 percentile (simple lines) for 
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1 the corresponding null situation of no correlation between variables. Grey circles mark the 

2 value for the true number of modules in the original sample. 
3 

4 Figure 3. Analysis of hierarchical communities (a single, randomly taken simulated 

5 data set per case, with 100 randomized null data sets and 500 bootstrap resamples per data 

6 set). The graphs (left) show the variance criterion for all possible numbers of clusters along 

7 with the lower 2.5 percentile for the corresponding null situation of no correlation between 

8 variables (simple lines). Grey circles mark correct clustering results for regular partitions. 

9 The diagrams to the right represent the different situations. The grayscale indicate the value 

10 of the correlation between variables (triangles) in the same ellipse. These were 0.273 and 

11 0.545 in the eight variables cases (a - c), and 0.214, 0.429 and 0.643 in the 16 variables cases 

12 (d to f). 
13 

14 Figure 4. Frequency of numbers of clusters detected in computer simulations (1000 

15 replicates, eight variables). 2C and 4C are cases with two and four variable modules 

16 respectively, and 2/2C, hierarchical situations of two modules each divided in two sub- 

17 modules. The numbers to the right of the Cs mark the variance of components c, common to 

18 variables in the same module or sub-module: 1, variance = 0.01; 3, variance = 0.03 (the 

19 variance of component e was = 0.05 in all cases). Circles and squares show results for the 

20 proposed and the Blondel et al. procedures respectively. Grey symbols mark the value for the 

21 true number(s) of clusters. Results for one cluster are given in the x axe to ease the graphs' 

22 interpretation. This could never be the number of clusters detected because the results for the 

23 simulated and randomized cases are always identical in that case: all variables in the same 

24 and only cluster. The smaller font values are the numbers of replicates finding correct and 

25 significant results. In the hierarchical cases, this is two significant minima in the variance 
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1 criterion for two and four clusters with the proposed procedure. In italics, number of 

2 replicates finding the same two minima, whether significant or not. 
3 

4 
5 

6 Table 1. Cases considered in Figure 2. 

7 These differed in number and distribution of variables, number and sizes of modules (there 

8 were as many modules as sizes listed), sample sizes, variables distributions and variance of 

9 the components c common to variables in the same correlated group. In all cases, every 

10 variable included an independent component e with variance = 0.050. The correlations 

1 1 corresponding to the three considered variances of c were 0.375, 0.231 and 0.167. * c was 

12 generated as a beta variable with parameters a = 0.246 and [3 = 2, and e (see main text) as a 

13 beta variable with a = 0.625 and [3 = 2,using R function rbeta; the resulting x, c and e 

14 distributions were markedly asymmetric. ** c was generated as a uniform variable with range 

15 0 to 0.600 and e as a uniform variable with range 0 to 0.775 using R function runif. 



Case 


Variables 


Modules 


Sample size 


Components 


y(c) 




number 


sizes 




distribution 




a 


8 


4,4 


100 


Normal 


0.030 


b 


8 


2, 2, 2, 2 


100 


Normal 


0.030 


c 


8 


4,4 


100 


Normal 


0.010 


d 


8 


4,4 


100 


Normal 


0.015 


e 


8 


4,4 


25 


Normal 


0.030 


f 


8 


4,4 


50 


Normal 


0.030 


g 


8 


1,1,1,1,1,1,2 


100 


Normal 


0.030 


h 


8 


5,2, 1 


100 


Normal 


0.030 
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i 


8 


4,4 


100 


Beta 


0.030* 


j 


8 


4,4 


100 


Uniform 


0.030** 


k 


4 


2,2 


100 


Normal 


0.030 


1 


16 


8, 8 


100 


Normal 


0.030 
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